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ABSTRACT 

This correspondence discusses the parallel and pipeline organization 
of fast unitary transforms algorithms such as the Fast Fourier Transform 
and points out the efficiency of a combined parallel-pipeline processor 
of a transform such as the Haar transform in which (2 n -l) hardware 
"butterflies” generate a transform of order 2 n every computation cycle. 
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Algorithms for all fast unitary transforms, such as the Fast Fourier 
transform (FFT) , fast Walsh-Hadamard transform (FWT) and other fast 
unitary transform [1], require n stages of computation for transforms of 
order 2 n . Each stage of computation can be in turn decomposed into at 
most 2 n ~^ "but terf lies' 1 [2] , each performing a rotation by a matrix of 
order 2. Some or all of the butterflies at one stage of computation can 
operate in parallel (see [3], [4] for FFT) and fast unitary transforms 
have thus a greater potential in applications with the development of 
low cost parallel circuitry. For example, we show in Fig. la the FFT 
Cooley-Tukey algorithm of order 4 with 2 butterflies in each of its 2 
stages of computation. If t seconds is the time required to perform a 
butterfly operation, each stage can be performed in t seconds with the 

tl“l 

highest possible degree of parallelism which uses 2 butterflies. Thus, 
a transform of order 2 n can be performed in nt seconds as compared to 
n2 n t seconds with sequential computation (which requires only one 

t 

butterfly). 

i 

If a number of successive transforms have to be computed, it is 
possible to increase further the throughput rate with several transformers 
working simultaneously, each operating on a different input vector and 
each possibly at a different stage of computation (see [5] for FFT): 
this is generally referred to as a pipeline organization. Parallel and 
pipeline organizations can be combined conveniently with n2 n ^"(at most) 

i « 

butterflies working in parallel and one transform of order 2 is obtained 
every x seconds on the average. Fig. lb shows a possible organization of 
the FFT Cooley-Tukey algorithm of order 4. All stages of this pipeline 
algorithm are identical: the 2 first butterflies perform the first stage 



of Fig. la and the 2 last butterflies perform the second stage. The 
input vector is entered in the first 4 cells and its FFT transform 
obtained in the same cells after 2 cycles. This algorithm can be wired- 
in and will give the transform coefficients in any order but it requires 
a large amount of hardware and requires the access at its input of two 
sets of n2 n storage cells. * 

Some transforms, however, do not require 2 n 1 butterflies at each 
stage of computation and then a pipeline algorithm can be implemented with 
much less hardware. We consider now in particular a pipeline algorithm 
for the Fast Haar Transform . (FHT) . Although less known, the FHT is 
closely related to the FWT [6], has a fast algorithm [7], is certainly a 
transform of interest for signal encoding [8], [9] and other applica- 
tions [10]. A pipeline-parallel algorithm for the FHT requires only 
(2 n -l) butterflies and still produces a transform of order 2 n at every 
cycle. We show in Fig. 2a the Haar matrix of order 8 and in Fig. 2b a 
possible organization of the FHT of the same order. The number of butter- 
flies decreases for successive stages and this is the property which can 
be exploited in a pipeline processor. In Fig. 3, we show a stage of a 
possible organization of the pipeline FHT of order 8. 

Many other transforms can have similar pipeline algorithms with 
reduced amount of hardware: the Modified generalized discrete transforms 

[11], the WFH transforms [1], the Slant Haar transforms [12] and other 
generalized Slant transforms [13]. In all cases, the pipeline-parallel 
algorithm needed to perform a transform of order 2 n in one cycle is the 
total number of butterflies appearing in the flow diagram of the algorithm. 
By contrast, parallel processing requires the maximum number of butterflies 
needed at any stage. 
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FOOTNOTE 

^The computation can be also performed "in place" with n2 storage cells 
only followed by cyclic shifts by 2 n cells. 

CAPTIONS 

Fig. la : FFT Cooley-Tukey Algorithm of order 4 

Fig. lb : Pipeline FFT Cooley-Tukey Algorithm of order 4 

Fig* 2a : Haar matrix of order 8 

Fig. 2b : Fast Haar Transform of order 8 

Fig. 3 : Pipeline Fast Haar Transform of order 8. 
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